Goto

Collaborating Authors

 directional pruning


Directional Pruning of Deep Neural Networks

Neural Information Processing Systems

In the light of the fact that the stochastic gradient descent (SGD) often finds a flat minimum valley in the training loss, we propose a novel directional pruning method which searches for a sparse minimizer in or close to that flat region. The proposed pruning method does not require retraining or the expert knowledge on the sparsity level. To overcome the computational formidability of estimating the flat directions, we propose to use a carefully tuned $\ell_1$ proximal gradient algorithm which can provably achieve the directional pruning with a small learning rate after sufficient training. The empirical results demonstrate the promising results of our solution in highly sparse regime (92% sparsity) among many existing pruning methods on the ResNet50 with the ImageNet, while using only a slightly higher wall time and memory footprint than the SGD. Using the VGG16 and the wide ResNet 28x10 on the CIFAR-10 and CIFAR-100, we demonstrate that our solution reaches the same minima valley as the SGD, and the minima found by our solution and the SGD do not deviate in directions that impact the training loss.



Review for NeurIPS paper: Directional Pruning of Deep Neural Networks

Neural Information Processing Systems

Additional Feedback: ### Comments My overall sense about this paper is that there is an interesting result here that would be significantly improved if the relationship to OBS were clarified, \mathcal{P}_0 were clarified, and the empirical results held up stronger. Particularly, on the last point, given that the method seems to not work particularly well with standard hyperparameters, I am less enthusiastic about directional pruning as a valuable pruning definition even though it seems natural. The results presented in the main body of the paper with non-standard hyperparameters and reduced accuracy for the initial network give me pause as well and, so perhaps the methodology of these experiments could be improved as well. Also, an alternative narrative that would make for a stronger result -- if true -- would be to map the OBS objective to solutions of this algorithm. In which case a reader needs not be concerned about if directional pruning itself is a valuable concept as OBS is already well established.


Review for NeurIPS paper: Directional Pruning of Deep Neural Networks

Neural Information Processing Systems

Thank you for your submission. There were many internal discussion about the paper. R3 championed the paper and appreciated the fact the method has theoretical footing. R1 & R2 raise critical issues with the empirical evaluation. R1 correctly highlighted that experiments do not include important baselines. Additionally, the evaluation was done on a nonstandard learning rate schedules, and results on standard learning rate schedule are not fully convincing (feedback didn't resolve this issue).


Directional Pruning of Deep Neural Networks

Neural Information Processing Systems

In the light of the fact that the stochastic gradient descent (SGD) often finds a flat minimum valley in the training loss, we propose a novel directional pruning method which searches for a sparse minimizer in or close to that flat region. The proposed pruning method does not require retraining or the expert knowledge on the sparsity level. To overcome the computational formidability of estimating the flat directions, we propose to use a carefully tuned \ell_1 proximal gradient algorithm which can provably achieve the directional pruning with a small learning rate after sufficient training. The empirical results demonstrate the promising results of our solution in highly sparse regime (92% sparsity) among many existing pruning methods on the ResNet50 with the ImageNet, while using only a slightly higher wall time and memory footprint than the SGD. Using the VGG16 and the wide ResNet 28x10 on the CIFAR-10 and CIFAR-100, we demonstrate that our solution reaches the same minima valley as the SGD, and the minima found by our solution and the SGD do not deviate in directions that impact the training loss.


Structured Directional Pruning via Perturbation Orthogonal Projection

YinchuanLi, null, XiaofengLiu, null, YunfengShao, null, QingWang, null, YanhuiGeng, null

arXiv.org Machine Learning

Structured pruning is an effective compression technique to reduce the computation of neural networks, which is usually achieved by adding perturbations to reduce network parameters at the cost of slightly increasing training loss. A more reasonable approach is to find a sparse minimizer along the flat minimum valley found by optimizers, i.e. stochastic gradient descent, which keeps the training loss constant. To achieve this goal, we propose the structured directional pruning based on orthogonal projecting the perturbations onto the flat minimum valley. We also propose a fast solver sDprun and further prove that it achieves directional pruning asymptotically after sufficient training. Experiments using VGG-Net and ResNet on CIFAR-10 and CIFAR-100 datasets show that our method obtains the state-of-the-art pruned accuracy (i.e. 93.97% on VGG16, CIFAR-10 task) without retraining. Experiments using DNN, VGG-Net and WRN28X10 on MNIST, CIFAR-10 and CIFAR-100 datasets demonstrate our method performs structured directional pruning, reaching the same minimum valley as the optimizer.


Directional Pruning of Deep Neural Networks

Chao, Shih-Kang, Wang, Zhanyu, Xing, Yue, Cheng, Guang

arXiv.org Machine Learning

In the light of the fact that the stochastic gradient descent (SGD) often finds a flat minimum valley in the training loss, we propose a novel directional pruning method which searches for a sparse minimizer in or close to that flat region. The proposed pruning method does not require retraining or the expert knowledge on the sparsity level. To overcome the computational formidability of estimating the flat directions, we propose to use a carefully tuned $\ell_1$ proximal gradient algorithm which can provably achieve the directional pruning with a small learning rate after sufficient training. The empirical results demonstrate the promising results of our solution in highly sparse regime (92% sparsity) among many existing pruning methods on the ResNet50 with the ImageNet, while using only a slightly higher wall time and memory footprint than the SGD. Using the VGG16 and the wide ResNet 28x10 on the CIFAR-10 and CIFAR-100, we demonstrate that our solution reaches the same minima valley as the SGD, and the minima found by our solution and the SGD do not deviate in directions that impact the training loss. The code that reproduces the results of this paper is available at https://github.com/donlan2710/gRDA-Optimizer/tree/master/directional_pruning.